This document is a data science report of the kaggle house prices tutorial project. It was generated using the Shapash library.
Version : 0.7
Name : House Prices Prediction Project
Purpose : Predicting the sale price of houses
Date : 2021-10-26
Contributors : Yann Golhen, Sebastien Bidault, Thomas Bouche, Guillaume Vignal, Thibaud Real
Description : This work is a data science project that tries to predict the sale of houses based on 79 explanatory variables. It was designed inside the data science team at X. and improved since the beggining of the project in 2019. The model was put into production since February 2021.
Source Code : https://github.com/MAIF/shapash/tree/master/tutorial
Git Commit : 1ff46e83beafba8949a7f3b7de27586acd6ae99e
Origin : The Assessor’s Office
Description : the sale of individual residential property in Ames, Iowa
Depth : from 2006 to 2010
Perimeter : only residential sales
Target Variable : SalePrice
Target Description : The property's sale price in dollars
Variable Filetring : All variables that required special knowledge or previous calculations for their use were removed
Individual Filtering : only the most recent sales data on any property were kept (for houses that were sold multiple times during this period)
Missing Values : were replaced by 0
Feature Engineering : No feature was created. All features are directly taken from the kaggle dataset. Categorical features were transformed using an ordinal encoder.
Path To Script : https://github.com/MAIF/shapash/tree/master/tutorial/
Used Algorithm : We used a RandomForestRegressor algorithm (scikit-learn) but this model could be challenged with other interesting models such as XGBRegressor, Neural Networks, ...
Parameters Choice : We did not perform any hyperparameter optimisation and chose to use n_estimators=50. Future works should be planned to perform gridsearch optimizations
Metrics : Mean Squared Error metric
Validation Strategy : We splitted our data into train (75%) and test (25%)
Path To Script : https://github.com/MAIF/shapash/tree/master/tutorial/
Model used : RandomForestRegressor
Library : sklearn.ensemble._forest
Library version : 0.24.1
Model parameters :
| Parameter key | Parameter value |
|---|---|
| base_estimator | DecisionTreeRegressor() |
| n_estimators | 50 |
| estimator_params | ('criterion', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'min_weight_fraction_leaf', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'random_state', 'ccp_alpha') |
| bootstrap | True |
| oob_score | False |
| n_jobs | None |
| random_state | None |
| verbose | 0 |
| warm_start | False |
| class_weight | None |
| max_samples | None |
| criterion | mse |
| max_depth | None |
| Parameter key | Parameter value |
|---|---|
| min_samples_split | 2 |
| min_samples_leaf | 1 |
| min_weight_fraction_leaf | 0.0 |
| max_features | auto |
| max_leaf_nodes | None |
| min_impurity_decrease | 0.0 |
| min_impurity_split | None |
| ccp_alpha | 0.0 |
| n_features_in_ | 72 |
| n_features_ | 72 |
| n_outputs_ | 1 |
| base_estimator_ | DecisionTreeRegressor() |
| estimators_ | [DecisionTreeRegressor(max_features='auto', random_state=662305423), DecisionTreeRegressor(max_features='auto', random_state=661015781), DecisionTreeRegressor(max_features='auto', random_state=1578391283), DecisionTreeRegressor(max_features='auto', random_state=1906048284),... |
| Training dataset | Prediction dataset | |
|---|---|---|
| number of features | 72 | 72 |
| number of observations | 1,095 | 365 |
| missing values | 0 | 0 |
| % missing values | 0 | 0 |
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
$Value of miscellaneous feature
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 51.5 | 19.3 |
| std | 569 | 111 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 0 | 0 |
| max | 15,500 | 1,200 |
Basement full bathrooms
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 3 |
| missing values | 0 | 0 |
Basement half bathrooms
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Bedrooms above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 8 | 7 |
| missing values | 0 | 0 |
Building Class
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 15 | 15 |
| missing values | 0 | 0 |
Central air conditioning
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
Condition of sale
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Electrical system
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 4 |
| missing values | 0 | 0 |
Enclosed porch area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 23 | 18.8 |
| std | 63.2 | 54.5 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 0 | 0 |
| max | 552 | 272 |
Exterior covering on house
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 14 | 12 |
| missing values | 0 | 0 |
Exterior materials' condition
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 3 |
| missing values | 0 | 0 |
Exterior materials' quality
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
First Floor square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 1,180 | 1,120 |
| std | 400 | 341 |
| min | 334 | 483 |
| 25% | 886 | 864 |
| 50% | 1,100 | 1,050 |
| 75% | 1,420 | 1,320 |
| max | 4,690 | 2,630 |
Flatness of the property
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Full bathrooms above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Garage condition
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 4 |
| missing values | 0 | 0 |
Garage location
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Garage quality
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 4 |
| missing values | 0 | 0 |
General condition of the basement
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
General shape of property
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
General zoning classification
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 5 |
| missing values | 0 | 0 |
Ground living area square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 1,520 | 1,490 |
| std | 530 | 512 |
| min | 334 | 605 |
| 25% | 1,130 | 1,130 |
| 50% | 1,470 | 1,460 |
| 75% | 1,790 | 1,730 |
| max | 5,640 | 4,480 |
Half baths above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Heating quality and condition
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 5 |
| missing values | 0 | 0 |
Height of the basement
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Home functionality
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 7 | 5 |
| missing values | 0 | 0 |
Interior finish of the garage?
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Kitchen quality
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Kitchens above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Lot configuration
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 5 |
| missing values | 0 | 0 |
Lot size square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 10,600 | 10,300 |
| std | 10,200 | 9,410 |
| min | 1,300 | 1,600 |
| 25% | 7,500 | 7,740 |
| 50% | 9,500 | 9,300 |
| 75% | 11,600 | 11,500 |
| max | 215,000 | 165,000 |
Low quality finished square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 5.18 | 7.83 |
| std | 46.4 | 54.7 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 0 | 0 |
| max | 572 | 513 |
Masonry veneer area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 101 | 109 |
| std | 173 | 203 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 168 | 145 |
| max | 1,600 | 1,380 |
Masonry veneer type
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Month Sold
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 12 | 12 |
| missing values | 0 | 0 |
Number of fireplaces
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Open porch area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 46.9 | 46.1 |
| std | 67.6 | 62.1 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 26 | 24 |
| 75% | 66 | 72 |
| max | 547 | 341 |
Original construction date
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 1,970 | 1,970 |
| std | 30.3 | 29.8 |
| min | 1,870 | 1,880 |
| 25% | 1,950 | 1,950 |
| 50% | 1,970 | 1,970 |
| 75% | 2,000 | 2,000 |
| max | 2,010 | 2,010 |
Other exterior covering on house
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 16 | 14 |
| missing values | 0 | 0 |
Overall condition of the house
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 8 | 9 |
| missing values | 0 | 0 |
Overall material and finish of the house
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 10 | 10 |
| missing values | 0 | 0 |
Paved driveway
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Physical locations within Ames city limits
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 25 | 25 |
| missing values | 0 | 0 |
Pool area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 3 |
| missing values | 0 | 0 |
Proximity to other various conditions
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 7 | 3 |
| missing values | 0 | 0 |
Proximity to various conditions
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 9 | 8 |
| missing values | 0 | 0 |
Rating of basement finished area
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Rating of basement finished area (if present)
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Refers to walkout or garden level walls
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 4 | 4 |
| missing values | 0 | 0 |
Remodel date
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 1,990 | 1,980 |
| std | 20.6 | 20.7 |
| min | 1,950 | 1,950 |
| 25% | 1,970 | 1,960 |
| 50% | 1,990 | 1,990 |
| 75% | 2,000 | 2,000 |
| max | 2,010 | 2,010 |
Roof material
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 7 | 5 |
| missing values | 0 | 0 |
Screen porch area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 16.4 | 11 |
| std | 58 | 48.3 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 0 | 0 |
| max | 480 | 396 |
Second floor square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 342 | 362 |
| std | 435 | 442 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 728 | 728 |
| max | 1,870 | 2,060 |
Size of garage in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 475 | 466 |
| std | 211 | 222 |
| min | 0 | 0 |
| 25% | 329 | 336 |
| 50% | 480 | 466 |
| 75% | 576 | 576 |
| max | 1,420 | 1,390 |
Slope of property
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 3 | 3 |
| missing values | 0 | 0 |
Style of dwelling
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 8 | 8 |
| missing values | 0 | 0 |
Three season porch area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 3.72 | 2.48 |
| std | 31.6 | 21.1 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 0 | 0 |
| max | 508 | 245 |
Total rooms above grade
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 12 | 10 |
| missing values | 0 | 0 |
Total square feet of basement area
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 1,070 | 1,030 |
| std | 453 | 392 |
| min | 0 | 0 |
| 25% | 799 | 780 |
| 50% | 996 | 972 |
| 75% | 1,320 | 1,240 |
| max | 6,110 | 2,630 |
Type 1 finished square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 453 | 416 |
| std | 465 | 429 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 399 | 350 |
| 75% | 722 | 678 |
| max | 5,640 | 2,100 |
Type 2 finished square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 44.7 | 52 |
| std | 162 | 159 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 0 | 0 |
| max | 1,470 | 1,060 |
Type of dwelling
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 5 |
| missing values | 0 | 0 |
Type of foundation
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 5 |
| missing values | 0 | 0 |
Type of heating
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 3 |
| missing values | 0 | 0 |
Type of road access
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 2 |
| missing values | 0 | 0 |
Type of roof
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 6 | 6 |
| missing values | 0 | 0 |
Type of sale
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 9 | 8 |
| missing values | 0 | 0 |
Type of utilities available
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 2 | 1 |
| missing values | 0 | 0 |
Unfinished square feet of basement area
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 570 | 558 |
| std | 446 | 429 |
| min | 0 | 0 |
| 25% | 224 | 217 |
| 50% | 483 | 464 |
| 75% | 812 | 796 |
| max | 2,340 | 2,040 |
Wood deck area in square feet
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 96 | 89 |
| std | 124 | 131 |
| min | 0 | 0 |
| 25% | 0 | 0 |
| 50% | 0 | 0 |
| 75% | 168 | 164 |
| max | 736 | 857 |
Year Sold
| Training dataset | Prediction dataset | |
|---|---|---|
| distinct values | 5 | 5 |
| missing values | 0 | 0 |
Year garage was built
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 1,980 | 1,980 |
| std | 26.6 | 25.5 |
| min | 1,870 | 1,900 |
| 25% | 1,960 | 1,960 |
| 50% | 1,980 | 1,980 |
| 75% | 2,000 | 2,000 |
| max | 2,010 | 2,010 |
| Training dataset | Prediction dataset | |
|---|---|---|
| count | 1,095 | 365 |
| mean | 182,000 | 177,000 |
| std | 78,500 | 82,000 |
| min | 34,900 | 40,000 |
| 25% | 130,000 | 126,000 |
| 50% | 165,000 | 160,000 |
| 75% | 215,000 | 205,000 |
| max | 755,000 | 745,000 |
Note : the explainability graphs were generated using the test set only.
$Value of miscellaneous feature
Basement full bathrooms
Basement half bathrooms
Bedrooms above grade
Building Class
Central air conditioning
Condition of sale
Electrical system
Enclosed porch area in square feet
Exterior covering on house
Exterior materials' condition
Exterior materials' quality
First Floor square feet
Flatness of the property
Full bathrooms above grade
Garage condition
Garage location
Garage quality
General condition of the basement
General shape of property
General zoning classification
Ground living area square feet
Half baths above grade
Heating quality and condition
Height of the basement
Home functionality
Interior finish of the garage?
Kitchen quality
Kitchens above grade
Lot configuration
Lot size square feet
Low quality finished square feet
Masonry veneer area in square feet
Masonry veneer type
Month Sold
Number of fireplaces
Open porch area in square feet
Original construction date
Other exterior covering on house
Overall condition of the house
Overall material and finish of the house
Paved driveway
Physical locations within Ames city limits
Pool area in square feet
Proximity to other various conditions
Proximity to various conditions
Rating of basement finished area
Rating of basement finished area (if present)
Refers to walkout or garden level walls
Remodel date
Roof material
Screen porch area in square feet
Second floor square feet
Size of garage in square feet
Slope of property
Style of dwelling
Three season porch area in square feet
Total rooms above grade
Total square feet of basement area
Type 1 finished square feet
Type 2 finished square feet
Type of dwelling
Type of foundation
Type of heating
Type of road access
Type of roof
Type of sale
Type of utilities available
Unfinished square feet of basement area
Wood deck area in square feet
Year Sold
Year garage was built
| True values | Prediction values | |
|---|---|---|
| count | 365 | 365 |
| mean | 177,000 | 177,000 |
| std | 82,000 | 70,500 |
| min | 40,000 | 66,100 |
| 25% | 126,000 | 128,000 |
| 50% | 160,000 | 157,000 |
| 75% | 205,000 | 200,000 |
| max | 745,000 | 524,000 |
Mean absolute error : 16,100
Mean squared error : 626,000,000